Below is the list of literature review regarding Bayesian Net for root cause analysis :-
| Citation | Method | Output | Conclusions |
|---|---|---|---|
| TroubleMiner: Mining network trouble tickets Medem, A. ; Akodjenou, M.-I ; Teixeira, R. 20091 | Trouble tickets classification | * Automation process on clustering the free text inside the description of the trouble tickets. * Choosing the correct keywords for the analysis | Using term frequency distance between trouble tickets and similarity between clusters |
| Knowledge Discovery from Trouble Ticketing Reports in a Large Telecommunication Company Temprado, Y. ; Garcia, C. ; Molinero, F.J. 2009 | Data Mining , Text Mining and Machine Learning , Bayes Net, Naïve Bayes | Prediction on the next action of trouble tickets ,Different snapshots were added to the machine learning algorithm for training | Combination of multiple method to construct the recommendation , Using Bayesian for prediction |
| A Bayesian Approach To Stochastic Root Finding 2011 | in progress | in progress | in progress |
| A Fully Bayesian Approach For Unit Root Testing 2011 | in progress | in progress | in progress |
| Online Root-Cause Analysis Of Alarms In Discrete Bayesian 2014 | in progress | in progress | in progress |
| Documents Categorization Based On Bayesian Spanning Tree 2006 | in progress | in progress | in progress |
| Benefits of a Bayesian Approach to Anomaly and Failure 2009 | in progress | in progress | in progress |
| Rules | Description |
|---|---|
| status = ‘Closed’ | Dataset must be closed for complete information |
| network_tt_id is NULL | Dataset must be not related to Network Trouble Ticket |
| trouble ticket type <> PASSIVE | Trouble Ticket must related to the Active elements such as routers, switches , modem , etc |
| installed_date is NOT NULL | This field must have value |
| created_date is NOT NULL | This field must have value |
| closed_date is NOT NULL | This field must have value |
| closed_date is NOT NULL | This field must have value |
| product is NOT NULL | This field must have value |
| sub_product is NOT NULL | This field must have value |
| length description > 10 | This field is useful for text analysis |
| rand() | Record selection is in random mode |
| zone | Should selective from different zone , sparse |
For sample purpose - selecting dataset from ZONE KEPONG for the analysis due to this zone has the highest records inside the Trouble Ticket dataset.
Documentation - https://github.com/piersharding/dplyrimpaladb
install.packages(c("RJDBC", "devtools", "dplyr"))
devtools::install_github("jwills/dplyrimpaladb")
install.packages("dplyrimpaladb")
Basic Impala drivers can be downloaded from https://github.com/Mu-Sigma/RImpala/blob/master/impala-jdbc-cdh5.zip
Below is the components required and how to set the class path for the Impala drivers , RJava , RJDBC and dplyr
setwd("D:/Google Drive/PHD/Progress/phdprogress1")
suppressWarnings(suppressMessages(library("rJava")))
suppressWarnings(suppressMessages(library("RJDBC")))
suppressWarnings(suppressMessages(library("dplyr")))
suppressWarnings(suppressMessages(library("caret")))
suppressWarnings(suppressMessages(library("corrplot")))
suppressWarnings(suppressMessages(library("lazy")))
suppressWarnings(suppressMessages(library("dplyrimpaladb")))
suppressWarnings(suppressMessages(library("rpart")))
suppressWarnings(suppressMessages(library("DiagrammeR")))
suppressWarnings(suppressMessages(library("klaR")))
suppressWarnings(suppressMessages(library("corrplot")))
suppressWarnings(suppressMessages(library("readxl")))
suppressWarnings(suppressMessages(library("tools")))
suppressWarnings(suppressMessages(library("Rgraphviz")))
suppressWarnings(suppressMessages(library("SnowballC")))
suppressWarnings(suppressMessages(library("tm")))
#add the class path for Impala jar
.jaddClassPath(c(list.files(paste(getwd(),"/lib",sep = ''),pattern="jar$",full.names=T)))
#add the class path for JVM
.jinit(classpath = c(list.files(paste(getwd(),"/lib",sep = ''),pattern="jar$",full.names=T)))
#add the class path for dplyr library
dplyr.jdbc.classpath = c(list.files(paste(getwd(),"/lib",sep = ''),pattern="jar$",full.names=T))
conn <- src_impaladb(dbname='nova', host='10.54.1.151')
## Loading required package: testthat
## [1] "here:"
## [1] FALSE
result <- tbl(conn, sql("select zone from nova.nova_trouble_ticket where zone <> 'null' group by zone order by zone limit 1000"))
as.data.frame(result)
## zone
## 1 ZONE AIR ITAM
## 2 ZONE BANGI
## 3 ZONE BANGSAR
## 4 ZONE BANTING
## 5 ZONE BATU
## 6 ZONE BATU PAHAT
## 7 ZONE BAYAN BARU
## 8 ZONE BINTULU
## 9 ZONE BUKIT ANGGERIK
## 10 ZONE BUKIT MERTAJAM
## 11 ZONE BUKIT RAJA
## 12 ZONE BUTTERWORTH
## 13 ZONE CYBERJAYA
## 14 ZONE GOMBAK
## 15 ZONE IPOH
## 16 ZONE KAJANG
## 17 ZONE KEPONG
## 18 ZONE KERAMAT
## 19 ZONE KINRARA
## 20 ZONE KL CENTRAL
## 21 ZONE KLANG
## 22 ZONE KOTA KINABALU SELATAN
## 23 ZONE KOTA KINABALU UTARA
## 24 ZONE KUCHING
## 25 ZONE KULIM
## 26 ZONE LANGKAWI
## 27 ZONE MALURI
## 28 ZONE MELAKA UTARA
## 29 ZONE MIRI
## 30 ZONE N. SEMBILAN UTARA
## 31 ZONE PANDAN
## 32 ZONE PELANGI
## 33 ZONE PERLIS
## 34 ZONE PETALING JAYA
## 35 ZONE PUCHONG
## 36 ZONE SEBERANG JAYA
## 37 ZONE SENAI
## 38 ZONE SG PETANI
## 39 ZONE SHAH ALAM
## 40 ZONE SIBU
## 41 ZONE SKUDAI PONTIAN
## 42 ZONE STAMPIN
## 43 ZONE SUBANG JAYA
## 44 ZONE TAMAN PETALING
## 45 ZONE TAMPOI
## 46 ZONE TAR
## 47 ZONE TASEK
## 48 ZONE TASIK AMPANG
## 49 ZONE TDI
## 50 ZONE TELUK INTAN
## 51 ZONE TERENGGANU SELATAN
## 52 ZONE TERUNTUM
So , i can replace the SQL like '%ZONE __________ %' with the zone list above and apply limit 100 for each group of the zone. Each group combined via UNION ALL operation.
Example below :-
result <- tbl(conn, sql("select * from nova_trouble_ticket where zone <> 'null' limit 1"))
as.data.frame(apply(as.data.frame(result),2,class))
## apply(as.data.frame(result), 2, class)
## tt_row_id character
## tt_num character
## tt_type character
## tt_sub_type character
## status character
## severity character
## important_message character
## appointment_flag character
## nova_account_name character
## nova_subscriber_num character
## nova_account_num character
## package_row_id character
## created_by character
## category character
## symptom_error_code character
## priority character
## product character
## sub_product character
## package_name character
## network_tt_id character
## swap_order_num character
## cause_category character
## cause_code character
## resolution_code character
## closure_category character
## resolution_team character
## service_affected character
## service_order_num character
## btu_type character
## owner character
## owner_name character
## group_owner character
## owner_position character
## btu_platform character
## dp_location character
## created_date character
## pending_verify_date character
## closed_by character
## closed_date character
## source character
## installed_date character
## description character
## repeat_ticket_count character
## follow_up_ticket_count character
## fdp_device_name character
## fdp_site_name character
## olt_site_name character
## exchange character
## timestamp character
## contact_id character
## contact_name character
## contact_office_phone character
## contact_mobile_phone character
## contact_home_phone character
## contact_email_addr character
## due_date character
## part_num character
## network_layer character
## network_row_id character
## asset_id character
## ptt character
## zone character
## service_point_id character
Sample dataset - Selection trouble tickets only from Zone Kepong. The SQL is define by :-
Zone Kepong contains very rich information especially for the textual analysis and also one of the largest composition of the cause code & the resolution code which is good for the supervised learning.
| Rules | Description |
|---|---|
| a.status like ‘%Closed%’ | Dataset must be closed for complete information |
| network_tt_id = ‘null’ | Dataset must be not related to Network Trouble Ticket |
| trouble ticket type <> PASSIVE | Trouble Ticket must related to the Active elements such as routers, switches , modem , etc. Excluding for now if related to the 3rd party causes , customer behavior and Passive elements |
| installed_date is NOT NULL | This field must have value |
| created_date is NOT NULL | This field must have value |
| closed_date is NOT NULL | This field must have value |
| closed_date is NOT NULL | This field must have value |
| product is NOT NULL | This field must have value |
| sub_product is NOT NULL | This field must have value |
| length description > 10 | This field is useful for text analysis |
| rand() | Record selection is in random mode |
| zone | Should selective from different zone , sparse control |
Generated SQL :-
select * from nova_trouble_ticket a join active_code b on (trim(a.cause_code) = trim(b.cause_code)) join exchange_zone c ON (trim(a.exchange)=trim(c.building_id)) and (b.code <> 'PASSIVE' ) where c.zone_name like '%ZONE KEPONG%' and a.status like '%Closed%' and length(a.cause_category) > 1 and length(a.created_date) > 6 and length(a.closed_date) > 6 and length(a.installed_date) > 6 and a.package_name not like '%null%' and a.product not like '%null%' and a.sub_product not like '%null%' and length(a.description) > 10 and network_tt_id = 'null' order by rand() limit 10000 "
Removing non-related fields such as trouble ticket key , trouble ticket number , trouble ticket date etc.
conn <- src_impaladb(dbname='nova', host='10.54.1.151')
## [1] "here:"
## [1] FALSE
result <- tbl(conn, sql("select a.tt_row_id,a.tt_num,a.tt_type,a.tt_sub_type,a.status,a.severity,a.important_message,a.appointment_flag,a.nova_account_name,a.nova_subscriber_num,a.nova_account_num,a.package_row_id,a.created_by,a.category,a.symptom_error_code,a.priority,a.product,a.sub_product,a.package_name,a.network_tt_id,a.swap_order_num,a.cause_category,a.cause_code,a.resolution_code,a.closure_category,a.resolution_team,a.service_affected,a.service_order_num,a.btu_type,a.owner,a.owner_name,a.group_owner,a.owner_position,a.btu_platform,a.dp_location,a.created_date,a.pending_verify_date,a.closed_by,a.closed_date,a.source,a.installed_date,a.description,a.repeat_ticket_count,a.follow_up_ticket_count,a.fdp_device_name,
a.fdp_site_name,a.olt_site_name,a.exchange,a.`timestamp`,a.contact_id,a.contact_name,a.contact_office_phone,a.contact_mobile_phone,a.contact_home_phone,a.contact_email_addr,a.due_date,a.part_num,a.network_layer,a.network_row_id,a.asset_id,a.ptt,a.zone,a.service_point_id , c.zone_name, c.district,c.state, c.region from nova_trouble_ticket a join active_code b on (trim(a.cause_code) = trim(b.cause_code)) join exchange_zone c ON (trim(a.exchange)=trim(c.building_id)) and (b.code <> 'PASSIVE' ) where c.zone_name like '%ZONE KEPONG%' and a.status like '%Closed%' and length(a.cause_category) > 1 and length(a.created_date) > 6 and length(a.closed_date) > 6 and length(a.installed_date) > 6 and a.package_name not like '%null%' and a.product not like '%null%' and a.sub_product not like '%null%' and length(a.description) > 10 and a.network_tt_id = 'null' order by rand() limit 100"))
result <- as.data.frame(result)
Close the connection from Impala
x <- conn$con
class(x) <- c('JDBCConnection')
dbDisconnect(x)
## [1] TRUE
Save the class as the data.frame
df <- as.data.frame(result)
df_tm <- as.data.frame(result)
df$contact_name <- NULL
df$contact_home_phone <- NULL
df$contact_email_addr <- NULL
df$contact_office_phone <- NULL
df$contact_mobile_phone <- NULL
df$`tt_row_id` <- NULL
df$`tt_num` <- NULL
df$tt_type <- NULL
df$`created_date` <- NULL
df$`closed_date` <- NULL
df$`installed_date` <- NULL
df$timestamp <- NULL
df$service_point_id <- NULL
df$contact_id <- NULL
df$owner_position <- NULL
df$tt_sub_type <- NULL
df$severity <- NULL
df$status <- NULL
df$important_message <- NULL
df$network_tt_id <- NULL
df$swap_order_num <- NULL
df$appointment_flag <- NULL
df$nova_account_name <- NULL
df$nova_subscriber_num <- NULL
df$nova_account_num <- NULL
df$repeat_ticket_count <- NULL
df$follow_up_ticket_count <- NULL
df$service_order_num <- NULL
df$source <- NULL
df$owner_name <- NULL
df$description <- NULL
df$due_date <- NULL
df$part_num <- NULL
df$zone <- NULL
df$ptt <- NULL
df$asset_id <- NULL
df$network_layer <- NULL
df$network_row_id <- NULL
df$pending_verify_date <- NULL
df$package_row_id <- NULL
df$priority <- NULL
summary(df)
## created_by category symptom_error_code
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## product sub_product package_name
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## cause_category cause_code resolution_code
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## closure_category resolution_team service_affected
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## btu_type owner group_owner
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## btu_platform dp_location closed_by
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## fdp_device_name fdp_site_name olt_site_name
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## exchange zone_name district
## Length:100 Length:100 Length:100
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## state region
## Length:100 Length:100
## Class :character Class :character
## Mode :character Mode :character
Looping the columns name and rename it to [column name]+1 as the factor name
for(i in names(df)){
num <- as.numeric(as.factor(df[,i]))
df <- cbind(df,num)
names(df)[names(df)=="num"] <- paste(names(df[i]),"_factor",sep = "")
print(paste(names(df[i]),"1",sep = ""))
}
## [1] "created_by1"
## [1] "category1"
## [1] "symptom_error_code1"
## [1] "product1"
## [1] "sub_product1"
## [1] "package_name1"
## [1] "cause_category1"
## [1] "cause_code1"
## [1] "resolution_code1"
## [1] "closure_category1"
## [1] "resolution_team1"
## [1] "service_affected1"
## [1] "btu_type1"
## [1] "owner1"
## [1] "group_owner1"
## [1] "btu_platform1"
## [1] "dp_location1"
## [1] "closed_by1"
## [1] "fdp_device_name1"
## [1] "fdp_site_name1"
## [1] "olt_site_name1"
## [1] "exchange1"
## [1] "zone_name1"
## [1] "district1"
## [1] "state1"
## [1] "region1"
df <- df[27:52]
names(df)
## [1] "created_by_factor" "category_factor"
## [3] "symptom_error_code_factor" "product_factor"
## [5] "sub_product_factor" "package_name_factor"
## [7] "cause_category_factor" "cause_code_factor"
## [9] "resolution_code_factor" "closure_category_factor"
## [11] "resolution_team_factor" "service_affected_factor"
## [13] "btu_type_factor" "owner_factor"
## [15] "group_owner_factor" "btu_platform_factor"
## [17] "dp_location_factor" "closed_by_factor"
## [19] "fdp_device_name_factor" "fdp_site_name_factor"
## [21] "olt_site_name_factor" "exchange_factor"
## [23] "zone_name_factor" "district_factor"
## [25] "state_factor" "region_factor"
write.csv(df,"ctt2014_small.csv")
Remove the predictors column which might have one unique value which can leads to zero variance result
The list below is non-zero variance variables
df <- df[,-nearZeroVar(df)]
names(df)
## [1] "created_by_factor" "category_factor"
## [3] "symptom_error_code_factor" "product_factor"
## [5] "sub_product_factor" "package_name_factor"
## [7] "cause_category_factor" "cause_code_factor"
## [9] "resolution_code_factor" "closure_category_factor"
## [11] "resolution_team_factor" "service_affected_factor"
## [13] "btu_type_factor" "owner_factor"
## [15] "group_owner_factor" "btu_platform_factor"
## [17] "dp_location_factor" "closed_by_factor"
## [19] "fdp_device_name_factor" "fdp_site_name_factor"
## [21] "olt_site_name_factor" "exchange_factor"
Find the correlation between the variables using Pearson.
correlations <- cor(df, use="pairwise.complete.obs", method="pearson")
print(correlations)
## created_by_factor category_factor
## created_by_factor 1.00000000 -0.072387248
## category_factor -0.07238725 1.000000000
## symptom_error_code_factor -0.08267702 0.198715902
## product_factor 0.08388301 -0.044170248
## sub_product_factor 0.16617625 -0.087225808
## package_name_factor 0.06271467 -0.104937159
## cause_category_factor 0.08414186 0.053469506
## cause_code_factor 0.06781792 0.072915126
## resolution_code_factor -0.01756382 -0.040664360
## closure_category_factor 0.05410537 0.103440224
## resolution_team_factor 0.09611271 0.214571696
## service_affected_factor -0.10211774 -0.294634711
## btu_type_factor -0.08708122 -0.014813596
## owner_factor 0.43534980 0.046706493
## group_owner_factor 0.01388123 0.049131139
## btu_platform_factor -0.03920239 -0.095018399
## dp_location_factor 0.01464646 0.045696501
## closed_by_factor 0.43534980 0.046706493
## fdp_device_name_factor -0.03085109 0.034875988
## fdp_site_name_factor -0.03085109 0.034875988
## olt_site_name_factor -0.01589588 0.006246004
## exchange_factor 0.03843539 0.025839067
## symptom_error_code_factor product_factor
## created_by_factor -0.082677023 0.08388301
## category_factor 0.198715902 -0.04417025
## symptom_error_code_factor 1.000000000 -0.10651881
## product_factor -0.106518810 1.00000000
## sub_product_factor -0.048073165 0.48755359
## package_name_factor 0.015390256 0.77553621
## cause_category_factor 0.158916426 0.24381425
## cause_code_factor 0.136987911 0.06411406
## resolution_code_factor 0.062227127 0.28855270
## closure_category_factor 0.005429765 -0.03349907
## resolution_team_factor 0.101226596 -0.37627774
## service_affected_factor -0.317962939 0.23751093
## btu_type_factor -0.079434062 0.17131098
## owner_factor 0.005034200 -0.02332783
## group_owner_factor 0.106279328 -0.04467218
## btu_platform_factor -0.031844469 0.10889355
## dp_location_factor 0.139383400 -0.03037303
## closed_by_factor 0.005034200 -0.02332783
## fdp_device_name_factor -0.073277163 -0.02430041
## fdp_site_name_factor -0.073277163 -0.02430041
## olt_site_name_factor -0.094243710 -0.04138103
## exchange_factor -0.051399544 -0.01354695
## sub_product_factor package_name_factor
## created_by_factor 0.16617625 0.062714675
## category_factor -0.08722581 -0.104937159
## symptom_error_code_factor -0.04807316 0.015390256
## product_factor 0.48755359 0.775536214
## sub_product_factor 1.00000000 0.555410598
## package_name_factor 0.55541060 1.000000000
## cause_category_factor 0.09546730 0.146596285
## cause_code_factor 0.21221574 0.093513229
## resolution_code_factor 0.11958663 0.168443826
## closure_category_factor -0.01441844 -0.061428430
## resolution_team_factor -0.18675861 -0.295632612
## service_affected_factor 0.24294257 0.238479552
## btu_type_factor 0.08918906 0.014639816
## owner_factor 0.03674080 0.011394674
## group_owner_factor -0.03216436 0.059038196
## btu_platform_factor 0.08599936 -0.003912656
## dp_location_factor 0.04786905 0.068812495
## closed_by_factor 0.03674080 0.011394674
## fdp_device_name_factor -0.09447411 -0.164135482
## fdp_site_name_factor -0.09447411 -0.164135482
## olt_site_name_factor -0.10305364 -0.202252716
## exchange_factor -0.10932264 -0.122840263
## cause_category_factor cause_code_factor
## created_by_factor 0.08414186 0.06781792
## category_factor 0.05346951 0.07291513
## symptom_error_code_factor 0.15891643 0.13698791
## product_factor 0.24381425 0.06411406
## sub_product_factor 0.09546730 0.21221574
## package_name_factor 0.14659629 0.09351323
## cause_category_factor 1.00000000 0.25932650
## cause_code_factor 0.25932650 1.00000000
## resolution_code_factor 0.46757026 0.15279525
## closure_category_factor 0.13686841 -0.11892871
## resolution_team_factor -0.27080474 0.14141374
## service_affected_factor -0.01872455 0.01142710
## btu_type_factor 0.06395550 -0.05762476
## owner_factor -0.16523259 -0.09099995
## group_owner_factor -0.13980399 -0.12887935
## btu_platform_factor 0.06837131 -0.07330032
## dp_location_factor 0.02200759 0.17676213
## closed_by_factor -0.16523259 -0.09099995
## fdp_device_name_factor -0.02086297 -0.28877825
## fdp_site_name_factor -0.02086297 -0.28877825
## olt_site_name_factor 0.01073973 -0.30538266
## exchange_factor -0.01147051 -0.32079033
## resolution_code_factor closure_category_factor
## created_by_factor -0.01756382 0.054105367
## category_factor -0.04066436 0.103440224
## symptom_error_code_factor 0.06222713 0.005429765
## product_factor 0.28855270 -0.033499075
## sub_product_factor 0.11958663 -0.014418442
## package_name_factor 0.16844383 -0.061428430
## cause_category_factor 0.46757026 0.136868408
## cause_code_factor 0.15279525 -0.118928712
## resolution_code_factor 1.00000000 0.144926338
## closure_category_factor 0.14492634 1.000000000
## resolution_team_factor -0.41490201 -0.054166643
## service_affected_factor -0.01306866 0.104292947
## btu_type_factor 0.33161471 0.054083570
## owner_factor -0.18849713 -0.108159548
## group_owner_factor -0.24600919 -0.097840859
## btu_platform_factor 0.32934291 0.073878259
## dp_location_factor -0.24313781 -0.078841974
## closed_by_factor -0.18849713 -0.108159548
## fdp_device_name_factor 0.22182602 0.027996173
## fdp_site_name_factor 0.22182602 0.027996173
## olt_site_name_factor 0.25645181 0.054411002
## exchange_factor 0.15430579 0.015553757
## resolution_team_factor service_affected_factor
## created_by_factor 0.09611271 -0.10211774
## category_factor 0.21457170 -0.29463471
## symptom_error_code_factor 0.10122660 -0.31796294
## product_factor -0.37627774 0.23751093
## sub_product_factor -0.18675861 0.24294257
## package_name_factor -0.29563261 0.23847955
## cause_category_factor -0.27080474 -0.01872455
## cause_code_factor 0.14141374 0.01142710
## resolution_code_factor -0.41490201 -0.01306866
## closure_category_factor -0.05416664 0.10429295
## resolution_team_factor 1.00000000 -0.13246490
## service_affected_factor -0.13246490 1.00000000
## btu_type_factor -0.25822644 0.06582288
## owner_factor 0.27837402 -0.19032833
## group_owner_factor -0.01349542 -0.28777200
## btu_platform_factor -0.21900417 0.11727936
## dp_location_factor 0.16254032 -0.09467019
## closed_by_factor 0.27837402 -0.19032833
## fdp_device_name_factor -0.11517543 0.05452589
## fdp_site_name_factor -0.11517543 0.05452589
## olt_site_name_factor -0.10949318 0.04547937
## exchange_factor -0.03080287 0.11172718
## btu_type_factor owner_factor group_owner_factor
## created_by_factor -0.08708122 0.43534980 0.01388123
## category_factor -0.01481360 0.04670649 0.04913114
## symptom_error_code_factor -0.07943406 0.00503420 0.10627933
## product_factor 0.17131098 -0.02332783 -0.04467218
## sub_product_factor 0.08918906 0.03674080 -0.03216436
## package_name_factor 0.01463982 0.01139467 0.05903820
## cause_category_factor 0.06395550 -0.16523259 -0.13980399
## cause_code_factor -0.05762476 -0.09099995 -0.12887935
## resolution_code_factor 0.33161471 -0.18849713 -0.24600919
## closure_category_factor 0.05408357 -0.10815955 -0.09784086
## resolution_team_factor -0.25822644 0.27837402 -0.01349542
## service_affected_factor 0.06582288 -0.19032833 -0.28777200
## btu_type_factor 1.00000000 -0.15504445 -0.12921915
## owner_factor -0.15504445 1.00000000 0.33030464
## group_owner_factor -0.12921915 0.33030464 1.00000000
## btu_platform_factor 0.91632426 -0.18329232 -0.14965284
## dp_location_factor -0.77537561 0.12603192 0.14968938
## closed_by_factor -0.15504445 1.00000000 0.33030464
## fdp_device_name_factor 0.68689996 -0.12146031 -0.07982665
## fdp_site_name_factor 0.68689996 -0.12146031 -0.07982665
## olt_site_name_factor 0.67620677 -0.11011242 -0.09849533
## exchange_factor 0.39386981 -0.07144686 -0.09452025
## btu_platform_factor dp_location_factor
## created_by_factor -0.039202385 0.01464646
## category_factor -0.095018399 0.04569650
## symptom_error_code_factor -0.031844469 0.13938340
## product_factor 0.108893546 -0.03037303
## sub_product_factor 0.085999357 0.04786905
## package_name_factor -0.003912656 0.06881250
## cause_category_factor 0.068371308 0.02200759
## cause_code_factor -0.073300324 0.17676213
## resolution_code_factor 0.329342909 -0.24313781
## closure_category_factor 0.073878259 -0.07884197
## resolution_team_factor -0.219004166 0.16254032
## service_affected_factor 0.117279362 -0.09467019
## btu_type_factor 0.916324258 -0.77537561
## owner_factor -0.183292319 0.12603192
## group_owner_factor -0.149652836 0.14968938
## btu_platform_factor 1.000000000 -0.79164019
## dp_location_factor -0.791640195 1.00000000
## closed_by_factor -0.183292319 0.12603192
## fdp_device_name_factor 0.668271342 -0.65681126
## fdp_site_name_factor 0.668271342 -0.65681126
## olt_site_name_factor 0.670017850 -0.62280188
## exchange_factor 0.421064534 -0.37695976
## closed_by_factor fdp_device_name_factor
## created_by_factor 0.43534980 -0.03085109
## category_factor 0.04670649 0.03487599
## symptom_error_code_factor 0.00503420 -0.07327716
## product_factor -0.02332783 -0.02430041
## sub_product_factor 0.03674080 -0.09447411
## package_name_factor 0.01139467 -0.16413548
## cause_category_factor -0.16523259 -0.02086297
## cause_code_factor -0.09099995 -0.28877825
## resolution_code_factor -0.18849713 0.22182602
## closure_category_factor -0.10815955 0.02799617
## resolution_team_factor 0.27837402 -0.11517543
## service_affected_factor -0.19032833 0.05452589
## btu_type_factor -0.15504445 0.68689996
## owner_factor 1.00000000 -0.12146031
## group_owner_factor 0.33030464 -0.07982665
## btu_platform_factor -0.18329232 0.66827134
## dp_location_factor 0.12603192 -0.65681126
## closed_by_factor 1.00000000 -0.12146031
## fdp_device_name_factor -0.12146031 1.00000000
## fdp_site_name_factor -0.12146031 1.00000000
## olt_site_name_factor -0.11011242 0.97273379
## exchange_factor -0.07144686 0.88203830
## fdp_site_name_factor olt_site_name_factor
## created_by_factor -0.03085109 -0.015895876
## category_factor 0.03487599 0.006246004
## symptom_error_code_factor -0.07327716 -0.094243710
## product_factor -0.02430041 -0.041381032
## sub_product_factor -0.09447411 -0.103053644
## package_name_factor -0.16413548 -0.202252716
## cause_category_factor -0.02086297 0.010739731
## cause_code_factor -0.28877825 -0.305382658
## resolution_code_factor 0.22182602 0.256451813
## closure_category_factor 0.02799617 0.054411002
## resolution_team_factor -0.11517543 -0.109493184
## service_affected_factor 0.05452589 0.045479369
## btu_type_factor 0.68689996 0.676206768
## owner_factor -0.12146031 -0.110112425
## group_owner_factor -0.07982665 -0.098495332
## btu_platform_factor 0.66827134 0.670017850
## dp_location_factor -0.65681126 -0.622801882
## closed_by_factor -0.12146031 -0.110112425
## fdp_device_name_factor 1.00000000 0.972733794
## fdp_site_name_factor 1.00000000 0.972733794
## olt_site_name_factor 0.97273379 1.000000000
## exchange_factor 0.88203830 0.868256058
## exchange_factor
## created_by_factor 0.03843539
## category_factor 0.02583907
## symptom_error_code_factor -0.05139954
## product_factor -0.01354695
## sub_product_factor -0.10932264
## package_name_factor -0.12284026
## cause_category_factor -0.01147051
## cause_code_factor -0.32079033
## resolution_code_factor 0.15430579
## closure_category_factor 0.01555376
## resolution_team_factor -0.03080287
## service_affected_factor 0.11172718
## btu_type_factor 0.39386981
## owner_factor -0.07144686
## group_owner_factor -0.09452025
## btu_platform_factor 0.42106453
## dp_location_factor -0.37695976
## closed_by_factor -0.07144686
## fdp_device_name_factor 0.88203830
## fdp_site_name_factor 0.88203830
## olt_site_name_factor 0.86825606
## exchange_factor 1.00000000
Find the highest correlated variables.
| Rules | Description |
|---|---|
| - +.70 or higher | Very strong relationship |
| - +.40 to +.69 | Strong positive relationship |
| - +.30 to +.39 | Moderate relationship |
| - +.20 to +.29 | weak relationship |
| - +.01 to +.19 | No or negligible relationship |
# Choose 0.7 Very strong relationship
highlyCorrelated <- findCorrelation(correlations, 0.7 ,verbose = FALSE,names = TRUE)
highlyCorrelated
## [1] "olt_site_name_factor" "fdp_device_name_factor"
## [3] "fdp_site_name_factor" "btu_platform_factor"
## [5] "btu_type_factor" "owner_factor"
## [7] "package_name_factor"
Summary of the correlated variables.
summary(correlations)
## created_by_factor category_factor symptom_error_code_factor
## Min. :-0.10212 Min. :-0.29463 Min. :-0.317963
## 1st Qu.:-0.03085 1st Qu.:-0.04329 1st Qu.:-0.073277
## Median : 0.02654 Median : 0.03488 Median : 0.005034
## Mean : 0.09427 Mean : 0.05361 Mean : 0.044360
## 3rd Qu.: 0.08408 3rd Qu.: 0.05238 3rd Qu.: 0.105016
## Max. : 1.00000 Max. : 1.00000 Max. : 1.000000
## product_factor sub_product_factor package_name_factor
## Min. :-0.37628 Min. :-0.18676 Min. :-0.29563
## 1st Qu.:-0.03272 1st Qu.:-0.07744 1st Qu.:-0.09406
## Median :-0.01844 Median : 0.04230 Median : 0.01502
## Mean : 0.12161 Mean : 0.10936 Mean : 0.09555
## 3rd Qu.: 0.22096 3rd Qu.: 0.15453 3rd Qu.: 0.13333
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## cause_category_factor cause_code_factor resolution_code_factor
## Min. :-0.27080 Min. :-0.32079 Min. :-0.41490
## 1st Qu.:-0.02033 1st Qu.:-0.11195 1st Qu.:-0.03489
## Median : 0.05871 Median : 0.03777 Median : 0.14886
## Mean : 0.09083 Mean : 0.02840 Mean : 0.11669
## 3rd Qu.: 0.14416 3rd Qu.: 0.14031 3rd Qu.: 0.24780
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## closure_category_factor resolution_team_factor service_affected_factor
## Min. :-0.11893 Min. :-0.41490 Min. :-0.31796
## 1st Qu.:-0.05961 1st Qu.:-0.21094 1st Qu.:-0.12488
## Median : 0.02177 Median :-0.08183 Median : 0.02845
## Mean : 0.05125 Mean :-0.01453 Mean : 0.02918
## 3rd Qu.: 0.06901 3rd Qu.: 0.13137 3rd Qu.: 0.10987
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## btu_type_factor owner_factor group_owner_factor
## Min. :-0.77538 Min. :-0.19033 Min. :-0.28777
## 1st Qu.:-0.08517 1st Qu.:-0.12146 1st Qu.:-0.12128
## Median : 0.05902 Median :-0.04739 Median :-0.06225
## Mean : 0.15632 Mean : 0.07912 Mean : 0.01893
## 3rd Qu.: 0.37831 3rd Qu.: 0.10620 3rd Qu.: 0.05656
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## btu_platform_factor dp_location_factor closed_by_factor
## Min. :-0.79164 Min. :-0.79164 Min. :-0.19033
## 1st Qu.:-0.08959 1st Qu.:-0.34350 1st Qu.:-0.12146
## Median : 0.07112 Median : 0.01833 Median :-0.04739
## Mean : 0.15262 Mean :-0.10218 Mean : 0.07912
## 3rd Qu.: 0.39813 3rd Qu.: 0.12603 3rd Qu.: 0.10620
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## fdp_device_name_factor fdp_site_name_factor olt_site_name_factor
## Min. :-0.65681 Min. :-0.65681 Min. :-0.622802
## 1st Qu.:-0.11000 1st Qu.:-0.11000 1st Qu.:-0.107883
## Median :-0.02258 Median :-0.02258 Median :-0.004825
## Mean : 0.17081 Mean : 0.17081 Mean : 0.169093
## 3rd Qu.: 0.55666 3rd Qu.: 0.55666 3rd Qu.: 0.566626
## Max. : 1.00000 Max. : 1.00000 Max. : 1.000000
## exchange_factor
## Min. :-0.376960
## 1st Qu.:-0.071447
## Median : 0.002042
## Mean : 0.159936
## 3rd Qu.: 0.333979
## Max. : 1.000000
Plot correlated variables.
png(height=1200, width=1200, pointsize=15, file="corrplot.png")
corrplot(correlations, method = "number",tl.cex = 0.9 ,addCoef.col="grey", order = "AOE")
dev.off()
## png
## 2
Feature selection process to confirm which variable does become the independent and resolution code is the dependent variable via GBM (Stochastic Gradient Boosting).
List of other available model - http://topepo.github.io/caret/modelList.html
set.seed(777)
suppressWarnings(suppressMessages(library(mlbench)))
control <-
trainControl(method = "repeatedcv", number = 5)
model <-
train(
resolution_code_factor ~ ., data = df, method = "gbm", preProcess = "scale", trControl =
control , verbose = FALSE
)
## Loading required package: gbm
## Warning: package 'gbm' was built under R version 3.2.2
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:graph':
##
## join
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
importance <- varImp(model, scale = TRUE)
print(importance)
## gbm variable importance
##
## only 20 most important variables shown (out of 21)
##
## Overall
## cause_code_factor 100.0000
## cause_category_factor 88.4202
## resolution_team_factor 42.1766
## fdp_device_name_factor 29.1509
## owner_factor 22.4480
## service_affected_factor 12.3133
## created_by_factor 10.8777
## btu_type_factor 9.3577
## group_owner_factor 7.5313
## dp_location_factor 4.9879
## btu_platform_factor 3.4670
## closure_category_factor 3.2488
## sub_product_factor 0.6527
## olt_site_name_factor 0.0000
## closed_by_factor 0.0000
## exchange_factor 0.0000
## package_name_factor 0.0000
## category_factor 0.0000
## symptom_error_code_factor 0.0000
## product_factor 0.0000
plot(importance)
So far , the main variables or factors found are :-
as based on the list the importance plot after the tuning and fitting predictive model process.
After the status = Closed inside the trouble ticket dataset.
Still need to discuss about this. Why we need this dataset ?
I still do not know when the best time to train the dataset. The dataset keep updating every day when :-
I recommend that the dataset need to train for every week because of the size and the velocity of the dataset.
The Bayes model might need to rebuild if the prediction level become lower of the treshold ; i.e < 60% accuracy
Basically below is how i acquire the dataset as mention during the previous proposal defense below :-
And this is how the implementation will be in the future :-
Anyway , this is the basic on how i train the sample dataset ( only for Kepong Zone ) using NaiveBayes and make the prediction of the resolution code :-
col_names <- names(df)
df[,col_names] <- lapply(df[,col_names] , factor)
names(df)
## [1] "created_by_factor" "category_factor"
## [3] "symptom_error_code_factor" "product_factor"
## [5] "sub_product_factor" "package_name_factor"
## [7] "cause_category_factor" "cause_code_factor"
## [9] "resolution_code_factor" "closure_category_factor"
## [11] "resolution_team_factor" "service_affected_factor"
## [13] "btu_type_factor" "owner_factor"
## [15] "group_owner_factor" "btu_platform_factor"
## [17] "dp_location_factor" "closed_by_factor"
## [19] "fdp_device_name_factor" "fdp_site_name_factor"
## [21] "olt_site_name_factor" "exchange_factor"
# split 70/30 from dataset into training and testing
trainIndex <- createDataPartition(df$resolution_code_factor, p=0.70, list=FALSE)
## Warning in createDataPartition(df$resolution_code_factor, p = 0.7, list
## = FALSE): Some classes have a single record ( 2, 4, 6, 7, 8, 9, 13 ) and
## these will be selected for the sample
data_train <- df[ trainIndex,]
data_test <- df[-trainIndex,]
# train a naive bayes model
model <- NaiveBayes(as.factor(resolution_code_factor)~., data=data_train)
# make predictions
predictions <- suppressWarnings(suppressMessages(predict(model, data_test)))
# summarize results by predicting resolution code using the test dataset -> data_test
cm <- suppressWarnings(suppressMessages(confusionMatrix(predictions$class, data_test$resolution_code_factor)))
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1 14 0 0 0 0 0 0 0 0 1 1 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 8 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0
## 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
##
## Overall Statistics
##
## Accuracy : 0.8261
## 95% CI : (0.6122, 0.9505)
## No Information Rate : 0.6522
## P-Value [Acc > NIR] : 0.05753
##
## Kappa : 0.6567
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.9333 NA NA NA NA NA
## Specificity 0.7500 1 1 1 1 1
## Pos Pred Value 0.8750 NA NA NA NA NA
## Neg Pred Value 0.8571 NA NA NA NA NA
## Prevalence 0.6522 0 0 0 0 0
## Detection Rate 0.6087 0 0 0 0 0
## Detection Prevalence 0.6957 0 0 0 0 0
## Balanced Accuracy 0.8417 NA NA NA NA NA
## Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity NA NA NA 0.00000 0.00000
## Specificity 1 0.95652 0.95652 1.00000 1.00000
## Pos Pred Value NA NA NA NaN NaN
## Neg Pred Value NA NA NA 0.95652 0.95652
## Prevalence 0 0.00000 0.00000 0.04348 0.04348
## Detection Rate 0 0.00000 0.00000 0.00000 0.00000
## Detection Prevalence 0 0.04348 0.04348 0.00000 0.00000
## Balanced Accuracy NA NA NA 0.50000 0.50000
## Class: 12 Class: 13 Class: 14 Class: 15
## Sensitivity NA NA 0.8000 1.00000
## Specificity 1 1 1.0000 1.00000
## Pos Pred Value NA NA 1.0000 1.00000
## Neg Pred Value NA NA 0.9474 1.00000
## Prevalence 0 0 0.2174 0.04348
## Detection Rate 0 0 0.1739 0.04348
## Detection Prevalence 0 0 0.1739 0.04348
## Balanced Accuracy NA NA 0.9000 1.00000
Based on the dataset from Kepong zone , we manage to get the prediction level up to more than 80%
So far , I’m only using the 70/30 data partition and also we also can consider to do the cross validattion sampling
Below is the option for 70/30 partition
trainIndex <- createDataPartition(df$resolution_code_factor, p=0.70, list=FALSE)
Below is the option for 10 folds of cross validation
trainControl(method = "repeatedcv", number = 10, repeats = 3)
Mention here on how to clean and filter the data via the SQL statement using Impala
The sample dataset have been converted into a factor and been applied here
Sample dataset after converted into factor which is required by the Bayes algorithm :-
head(df)
## created_by_factor category_factor symptom_error_code_factor
## 1 48 1 7
## 2 22 2 3
## 3 29 1 7
## 4 72 1 7
## 5 14 1 7
## 6 66 1 7
## product_factor sub_product_factor package_name_factor
## 1 2 19 8
## 2 3 10 6
## 3 1 7 3
## 4 3 12 8
## 5 3 12 8
## 6 3 12 8
## cause_category_factor cause_code_factor resolution_code_factor
## 1 3 5 1
## 2 3 7 8
## 3 4 9 6
## 4 8 17 11
## 5 2 22 15
## 6 5 2 13
## closure_category_factor resolution_team_factor service_affected_factor
## 1 3 2 5
## 2 3 15 1
## 3 3 15 3
## 4 1 16 1
## 5 3 15 4
## 6 3 14 3
## btu_type_factor owner_factor group_owner_factor btu_platform_factor
## 1 3 10 3 4
## 2 3 15 1 2
## 3 2 29 3 2
## 4 3 39 1 4
## 5 3 30 1 4
## 6 3 14 1 4
## dp_location_factor closed_by_factor fdp_device_name_factor
## 1 1 10 59
## 2 1 15 63
## 3 5 29 40
## 4 2 39 52
## 5 1 30 30
## 6 1 14 54
## fdp_site_name_factor olt_site_name_factor exchange_factor
## 1 59 21 2
## 2 63 24 2
## 3 40 11 2
## 4 52 19 2
## 5 30 10 1
## 6 54 21 2
Done that here and how to select the independent variables been mention here via the rank of the important factor
I’m still working on the other method such as Random Forest , GBM and RPart. We need to discuss on this
Tried before on how to do the feature selection via Generalized Boosted Regression here
I’m thinking on clustering the cause code & the resolution code because of the unique count each of them is around
I also have some reference on the category of the cause code. So far the cause code can be divided into :-
I'm only choose this type because most Trouble Ticket is related to the Active elements typeSo the clustering of the cause code & resolution is needed to reduce the size and become more manageble. So we can find the relationship among the variables.
Need to disucss and I will do this later
Need to discuss on this
Need to discuss on this
# workflow <- print(grViz("
# digraph neato {
#
#
# ratio = \"fill\";
# size=\"8.3,11.7!\";
# margin=0;
#
# # graph attributes
# graph [overlap = true, fontsize = 10]
#
# node [shape = box,style = filled,
# fontname = Arial,
# color = darkslategray , fillcolor = orange]
#
# # edge statements
# edge [color = grey]
# GET_Zone_Distinct_List -> RETURN_Zone_List -> LOOP_Zone_List -> GENERATE_SQL_filter_by_zone; DESIGN_Base_SQL ->
# APPLY_record_filter -> RECEIVE_Zone_Paramater -> GENERATE_SQL_filter_by_zone ->
# ORDER_records_in_random_mode -> Limit_records_per_zone -> Download_dataset -> Load_in_R->COVERT_dataset_as_dataframe->REMOVE_unrelated_fields->FETCH_sample_dataset->Split_70_30_for_training_and_testing->APPLY_feature_extraction->GET_related_field_for_training->APPLY_training_from_dataset->BUILD_model_from_training;
# Download_dataset-> SAVE_Trouble_Ticket_No_As_Reference;
# Download_dataset-> SAVE_Description_Text_As_Reference;
# }
# "),engine = "neato")
# c<- grVizOutput(workflow)
The example shows the process workflow
Previously have been mentioned here by applying SQL and condition rules here
During my last proposal defense , i proposed to develop the dictionary vector as the enrichment source for the trouble ticket prediction and also to identify the category of the messages.
I have done some simple code in R to demonstrate this :-
df_tm <- data.frame(df_tm$description , stringsAsFactors = TRUE)
mycorpus <- Corpus(DataframeSource(df_tm))
# remove all unrelated char/text
tdm <- TermDocumentMatrix(mycorpus, control = list(weight = weightTfIdf ,
removePunctuation = TRUE, stopwords = TRUE , removeNumbers = TRUE , stemming = FALSE))
# remove sparse term
#z <- m[c("oil", "zone"),
tdm <- removeSparseTerms(tdm, sparse= 0.8 )
# Generate the dictionary vector limit by 10 docs
c <- suppressWarnings(suppressMessages(inspect(tdm[,1:10])))
## <<TermDocumentMatrix (terms: 124, documents: 10)>>
## Non-/sparse entries: 769/471
## Sparsity : 38%
## Maximal term length: 20
## Weighting : term frequency (tf)
##
## Docs
## Terms 1 2 3 4 5 6 7 8 9 10
## acc 0 0 1 0 1 0 0 0 0 0
## account 1 1 2 1 2 4 4 2 3 1
## acct 1 1 1 2 1 1 1 1 1 1
## acknowledge 0 0 0 0 0 0 0 0 1 0
## active 1 1 1 1 1 1 1 1 1 0
## address 3 2 3 4 2 2 2 2 2 2
## addressing 1 1 1 1 1 1 1 1 1 0
## advice 1 3 0 3 2 1 1 1 3 2
## advise 0 0 0 0 0 0 0 0 3 0
## advised 0 0 0 0 1 1 0 2 0 0
## alive 1 0 0 0 0 0 0 0 0 1
## already 2 1 0 0 0 0 0 0 0 0
## also 0 0 0 0 1 0 0 0 0 0
## assigned 1 1 1 1 1 1 1 1 1 0
## available 0 0 1 2 1 1 0 1 1 0
## back 0 2 0 0 0 0 1 0 1 0
## bin 0 0 0 2 0 0 1 0 2 0
## bras 1 1 1 2 1 1 1 1 1 1
## bsrbatunifige 0 2 0 0 0 0 3 0 0 0
## btu 1 0 1 1 2 0 1 0 1 0
## cable 1 0 1 1 1 0 1 0 0 0
## call 0 3 1 0 0 1 0 0 1 1
## called 0 0 0 0 0 0 0 1 0 0
## caller 0 1 1 1 1 1 1 1 1 1
## calling 1 1 1 2 1 1 1 2 1 1
## check 2 0 5 3 3 5 5 1 3 3
## checked 0 0 0 0 0 0 0 2 0 0
## connect 1 1 1 2 1 1 1 1 1 1
## connection 0 1 1 2 1 0 0 2 0 1
## contact 0 1 1 1 1 1 1 1 1 1
## cpe 0 0 0 0 1 0 0 0 0 0
## create 2 2 2 1 1 1 0 2 0 0
## created 1 1 1 1 1 1 2 1 2 0
## ctt 2 1 1 2 1 1 1 0 0 1
## cust 3 0 0 5 0 1 0 0 3 0
## customer 2 10 2 7 7 4 2 5 7 4
## date 2 3 2 2 2 2 2 2 2 0
## detail 1 0 0 1 0 0 0 0 1 1
## details 1 1 0 2 1 1 0 0 2 0
## dlink 1 0 1 2 1 0 1 1 1 1
## done 1 0 2 0 0 1 0 0 0 3
## download 1 1 1 1 1 1 1 1 1 0
## dsl 1 0 0 0 1 0 0 0 1 1
## dynamically 1 1 1 1 1 1 1 0 1 0
## english 0 1 0 0 0 1 0 1 0 0
## equipment 0 0 0 0 0 0 0 0 0 0
## eth 1 2 0 0 0 3 0 0 0 1
## exchange 1 1 1 1 1 1 1 1 1 1
## framed 1 1 1 2 1 1 1 1 1 1
## full 0 1 0 0 0 0 0 0 0 0
## get 0 0 0 0 0 0 0 1 0 0
## holder 0 0 0 0 1 0 1 0 0 1
## hsi 2 0 2 2 1 1 1 1 3 2
## huaweiqosprofilename 1 1 1 1 1 1 1 1 1 0
## huaweiuserclass 1 1 1 1 1 1 1 1 1 0
## inform 1 3 0 1 1 0 0 0 0 0
## informed 0 0 0 0 0 0 0 1 0 0
## internet 2 1 1 1 1 0 1 2 2 2
## iptv 2 0 1 1 1 0 5 1 2 1
## jan 0 3 0 0 0 6 4 2 0 0
## lan 2 0 8 0 4 3 4 8 5 5
## language 1 1 1 0 1 1 1 1 1 1
## light 0 0 2 2 1 0 0 0 1 0
## login 1 1 1 1 1 1 1 1 1 0
## mac 2 1 2 2 1 1 1 2 1 1
## malay 1 0 1 0 0 0 1 0 1 1
## modem 0 0 1 1 1 0 0 1 1 0
## modified 1 1 1 1 1 1 1 1 1 0
## name 1 3 3 2 2 2 3 3 3 1
## number 2 1 2 2 2 0 2 3 3 1
## onu 0 0 1 0 0 0 0 1 0 0
## orange 0 0 1 0 1 0 0 0 0 0
## order 1 1 0 1 1 1 1 1 1 1
## outages 0 0 0 1 1 1 1 1 1 1
## outcome 1 1 1 0 0 1 1 0 1 1
## owner 1 1 0 1 0 1 1 0 1 0
## package 1 1 1 0 1 1 1 0 1 1
## persist 2 0 0 1 1 0 0 2 1 0
## phone 0 0 1 1 0 1 1 0 0 1
## ping 1 1 1 2 1 1 1 1 1 1
## power 4 0 2 0 2 3 2 2 2 3
## pppoe 1 1 1 1 1 1 1 0 1 0
## preferred 0 1 1 0 0 1 1 0 0 1
## problem 3 2 2 2 2 1 2 3 2 1
## proceed 0 2 0 1 0 1 0 1 0 0
## profile 1 1 1 1 1 1 1 1 1 0
## query 2 1 1 3 1 2 1 1 2 1
## rad 1 1 1 1 0 1 1 1 0 0
## raduser 1 1 1 1 1 1 0 1 1 0
## reboot 1 0 4 2 1 0 1 1 2 3
## result 3 1 2 6 2 3 1 2 3 1
## scheme 1 1 1 1 1 1 1 1 1 0
## sec 0 2 0 0 0 3 2 1 0 0
## service 0 0 0 0 1 0 0 0 0 1
## services 0 0 0 0 1 1 0 0 1 1
## session 3 4 3 5 3 4 5 3 3 4
## sgb 2 3 1 1 0 4 0 0 1 0
## since 0 1 0 0 0 1 0 0 0 0
## soc 0 1 0 1 1 0 0 1 0 1
## speed 1 1 1 1 1 1 1 1 1 0
## station 1 1 1 2 1 1 1 2 1 1
## status 2 1 3 1 2 3 2 1 2 0
## stb 3 0 0 0 0 0 3 0 0 0
## still 1 1 0 2 1 0 0 0 1 0
## streamyx 2 1 1 3 1 2 1 1 2 1
## system 0 2 0 0 0 0 0 1 0 0
## time 2 4 3 4 2 2 3 2 3 2
## total 1 1 1 2 1 1 1 1 1 1
## troubleshooting 0 2 1 1 0 1 0 0 1 0
## unable 0 0 0 1 0 0 0 1 0 0
## upload 1 1 1 1 1 1 1 1 1 0
## usage 1 1 1 1 0 1 1 1 1 0
## user 1 0 0 2 0 1 0 0 1 0
## username 1 1 1 1 1 1 1 2 1 1
## using 0 0 0 1 0 0 0 0 0 0
## vdsl 1 0 0 0 1 0 0 0 1 1
## verify 0 0 0 2 2 0 0 0 1 0
## vip 1 1 0 0 1 1 1 0 1 0
## vipm 1 1 0 1 1 1 1 0 1 0
## vobb 2 1 3 2 2 2 2 2 2 2
## voip 1 0 1 0 0 0 0 1 1 1
## will 0 1 0 0 0 0 0 0 0 0
## wireless 1 0 1 0 1 0 1 1 1 1
## xpon 0 0 1 0 0 0 3 1 0 0
# Plotting the relationship between words
plot(tdm, corThreshold = 0.2, weighting = TRUE)
So this dictionary vector can be used in the future to add on to the current dataset as the new variables.
# disable this code due to lots of messages appears
# transpose_tdm<- invisible(as.data.frame(inspect(t(tdm[,]))))
# disable this code due to lots of messages appears
# cause_code_tdm <- cbind(cbind(df$tt_row_id,df$cause_code),transpose_tdm[1:10,1:10])
# write to CSV
# write.csv(cause_code_tdm,"transpose_tdm.csv")
transpose_tdm <- read.csv("transpose_tdm.csv")
head(transpose_tdm)
## X trouble_ticket_id Cause_code acc account acct
## 1 1 1-WTJASD Clear While Localising 0 1 1
## 2 2 1-WTJ924 Customer_Behaviour 0 1 1
## 3 3 1-WTJPNC Drop Fiber Damage 1 2 1
## 4 4 1-WTGIQE TM_CPE_PG Config Problem 0 1 2
## 5 5 1-WTI5UQ VDSL Modem Faulty 1 2 1
## 6 6 1-WTDQ9Q AccessNetwork_Failure 0 4 1
## acknowledge active address addressing advice advise advised
## 1 0 1 3 1 1 0 0
## 2 0 1 2 1 3 0 0
## 3 0 1 3 1 0 0 0
## 4 0 1 4 1 3 0 0
## 5 0 1 2 1 2 0 1
## 6 0 1 2 1 1 0 1
We need to discuss this further.
Still in progress.
Submission date ?
Still in progress. Need to discuss.
Submission date ?
Still in progress. Need to discuss.
Submission date ?